This is the answers of the exercise by Junmar Sales II

Email:

Dated 25 September 2020.

You can also see this on my github pages: MEE 621 - Activity 2 on Data Transformation

This report on the answers of the exercises outline Activity 2 on Data transformation all done in R Studio.

Let’s load the prerequisite libraries, we have assumes that packages has been installed.

library("nycflights13")
library("tidyverse")
## -- Attaching packages ---------------------------------------------------------- tidyverse 1.3.0 --
## v ggplot2 3.3.2     v purrr   0.3.4
## v tibble  3.0.3     v dplyr   1.0.2
## v tidyr   1.1.2     v stringr 1.4.0
## v readr   1.3.1     v forcats 0.5.0
## -- Conflicts ------------------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

Checking data

?flights
## starting httpd help server ... done

5.2.4 Exercises

1. Find all flights that

1. Had an arrival delay of two or more hours

In order to find flights with an arrival delay of two or more ours, we should covert hours into minutes Since the arr_delay variable is measured in minutes. Therefore, the check becomes 120 or more minutes.

filter(flights, arr_delay >= 120)

2. Flew to Houston ( IAH or HOU )

We are looking into those flights where the destination (dest) is either “IAH” or “HOU”.

THere are two ways to filter it. It shows same answer but the first is more concise.

First:

filter(flights,dest %in% c("IAH", "HOU"))

Second:

filter(flights, dest == "IAH" | dest == "HOU")

3. Were operated by United, American, or Delta

Let’s check first what are the airline code for United, American or Delta.

airlines

We can now see that UA is for United Airlines, AA is for American Airlines, and DL is for Delta.

filter(flights, carrier %in% c("AA", "DL", "UA"))

4. Departed in summer (July, August, and September)

MOnths representation are in number. July is 7, August is 8 and soon.

filter(flights, month >= 7, month <= 9)

5. Arrived more than two hours late, but didn’t leave late

using some mathematical operators, we can run:

filter(flights, arr_delay > 120, dep_delay <= 0)

6. Were delayed by at least an hour, but made up over 30 minutes in flight

Let’s analyze:

If a flight was delayed by at least an hour, then dep_delay >= 60. If the flight didn’t make up any time in the air, then its arrival would be delayed by the same amount as its departure, meaning dep_delay == arr_delay, or alternatively, dep_delay - arr_delay == 0. If it makes up over 30 minutes in the air, then the arrival delay must be at least 30 minutes less than the departure delay, which is stated as dep_delay - arr_delay > 30.

filter(flights, dep_delay >= 60, dep_delay - arr_delay > 30)

7. Departed between midnight and 6am (inclusive)

How does midnight is depicted here? This is a riddle so we need to check summary.

summary(flights$dep_time)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##       1     907    1401    1349    1744    2400    8255

Viola midnight is 2400, not 0 as first thought.

filter(flights, dep_time <= 600 | dep_time == 2400)

2. Another useful dplyr filtering helper is between() . What does it do? Can you use it to simplify the code needed to answer the previous challenges?

filter(flights, between(month, 7, 9))

I noticed that the expression between(x, left, right) is equivalent to x >= left & x <= right.

3. How many flights have a missing dep_time ? What other variables are missing? What might these rows represent?

Find the rows of flights with a missing departure time (dep_time) using the is.na() function.

filter(flights, is.na(dep_time))

IT looks like these missing departure times are those canceled flights.

4. Why is NA ^ 0 not missing? Why is NA | TRUE not missing? Why is FALSE & NA not missing? Can you figure out the general rule? ( NA * 0 is a tricky counterexample!)

Let’s check:

NA ^ 0
## [1] 1

NA ^ 0 == 1 since for all numeric values x0=1 .

NA | TRUE
## [1] TRUE

NA | TRUE is TRUE because anything or TRUE is TRUE. By using OR operator, if the missing value were TRUE, then TRUE | TRUE == TRUE, and if the missing value was FALSE, then FALSE | TRUE == TRUE.

NA & FALSE
## [1] FALSE

The value of NA & FALSE is FALSE because anything and FALSE is always FALSE. If the missing value were TRUE, then TRUE & FALSE == FALSE, and if the missing value was FALSE, then FALSE & FALSE == FALSE.

NA | FALSE
## [1] NA

For NA | FALSE, the value is unknown since TRUE | FALSE == TRUE, but FALSE | FALSE == FALSE.

NA & FALSE
## [1] FALSE

For NA & TRUE, the value is unknown since FALSE & TRUE== FALSE, but TRUE & TRUE == TRUE.

NA * 0
## [1] NA

Since x∗0=0 for all finite numbers we might expect NA * 0 == 0, but that’s not the case. The reason that NA * 0 != 0 is that 0×∞ and 0×−∞

are undefined. R represents undefined results as NaN, which is an abbreviation of “not a number”.

Inf * 0
## [1] NaN
-Inf * 0
## [1] NaN

5.3.1 Exercises

1. How could you use arrange() to sort all missing values to the start? (Hint: use is.na() )

Using arrange() in default call,

arrange(flights, dep_time)

Let’s look where are the NAs, it looks like they are the end of the list.

arrange(flights, dep_time) %>%
  tail()

So let;s use is.na() to put all these NAs at the start of list for sorting.

arrange(flights, desc(is.na(dep_time)), dep_time)

The flights will first be sorted by desc(is.na(dep_time)) because desc(is.na(dep_time)) function is either TRUE when dep_time is missing, or FALSE, when it is not, the rows with missing values of dep_time will come first, since TRUE > FALSE.

2. Sort flights to find the most delayed flights. Find the flights that left earliest.

To find the most delayed flights, we will sort the table using departure delay, dep_delay, in descending order.

arrange(flights, desc(dep_delay))

The most delayed flights were scheduled to leave on January 09, 2013 09:00. Noticed that the departure time is given as 641, which is 641AM the next day, so around 21 hours and 41 minutes delayed. Curious to see what airlines are the most delayed - it was HA 51, JFK to HNL,

For the earliest departing flight, it can be found by sorting dep_delay in ascending order.

arrange(flights, dep_delay)

The earliest flights were noted to depart December 7, 2013 with dep_time 2040, a 43 minutes early than its scheduled dept time of 2123.

3. Sort flights to find the fastest (highest speed) flights.

I think to have the fastest flights, it should have the shortest air_time. Adding head() is to ensure it will be visible.

head(arrange(flights, (air_time)))

Fastest flights here are those with 20 minute flights fromEWR to BDL.

Or, we can compute to get the ground speed by dividing distance over air_time.

arrange(flights, desc(distance/air_time))

Based on ground speed, DL 1499 from LGA to ATL is the fastest.

4. Which flights travelled the farthest? Which travelled the shortest?

Sort the flights by the distance column in descending order to get the farthest flight.

arrange(flights, desc(distance))

In the data, it looks like the flight going to HNL from JFK is the farthest.

arrange(flights, air_time)

Flight UV 4368 from EWR and BDL is the shortest.

5.4.1 Exercises

1. Brainstorm as many ways as possible to select dep_time , dep_delay , arr_time ,and arr_delay from flights .

In selecting columns, I think we have few ways to do it.

  1. Specify column names as strings.
select(flights, "dep_time", "dep_delay", "arr_time", "arr_delay")
  1. Specify the column numbers of the variables.
select(flights, 4, 6, 7, 9)

This works, but is not good practice for two reasons. First, the column location of variables may change, resulting in code that may continue to run without error, but produce the wrong answer. Second code is obfuscated, since it is not clear from the code which variables are being selected. What variable does column 6 correspond to? I just wrote the code, and I’ve already forgotten.

  1. Specify columns names as unquoted variable names.
select(flights, dep_time, dep_delay, arr_time, arr_delay)
  1. Specify the names of the variables with character vector and any_of() or all_of()
select(flights, all_of(c("dep_time", "dep_delay", "arr_time", "arr_delay")))
select(flights, any_of(c("dep_time", "dep_delay", "arr_time", "arr_delay")))

This is useful because the names of the variables can be stored in a variable and passed to all_of() or any_of().

variables <- c("dep_time", "dep_delay", "arr_time", "arr_delay")
select(flights, all_of(variables))

These two functions replace the deprecated function one_of().

Selecting the variables by matching the start of their names using starts_with().

select(flights, starts_with("dep_"), starts_with("arr_"))

Selecting the variables using regular expressions with matches(). Regular expressions provide a flexible way to match string patterns and are discussed in the Strings chapter.

select(flights, matches("^(dep|arr)_(time|delay)$"))

Specify the names of the variables with a character vector and use the bang-bang operator (!!).

variables <- c("dep_time", "dep_delay", "arr_time", "arr_delay")
select(flights, !!variables)

This and the following answers use the features of tidy evaluation not covered in R4DS but covered in the Programming with dplyr vignette.

Specify the names of the variables in a character or list vector and use the bang-bang-bang operator.

variables <- c("dep_time", "dep_delay", "arr_time", "arr_delay")
select(flights, !!!variables)

Specify the unquoted names of the variables in a list using syms() and use the bang-bang-bang operator.

variables <- syms(c("dep_time", "dep_delay", "arr_time", "arr_delay"))
select(flights, !!!variables)

There are Some things that don’t work:

Matching the ends of their names using ends_with() since this will incorrectly include other variables. For example,

select(flights, ends_with("arr_time"), ends_with("dep_time"))

Matching the names using contains() since there is not a pattern that can include all these variables without incorrectly including others.

select(flights, contains("_time"), contains("arr_"))

2. What happens if you include the name of a variable multiple times in a select() call?

select(flights, dep_time, arr_time, carrier, dep_time, dep_time)

It looks like it ignores the duplicate variables in displaying. We need not see any error or warnings. Just it let it pass.

select(flights, everything())

This behavior is useful because it means that we can use select() with everything() in order to easily change the order of columns without having to specify the names of all the columns.

3. What does the one_of() function do? Why might it be helpful in conjunction with this vector?

vars <- c("year", "month", "day", "dep_delay", "arr_delay")

Per documentation,

The one_of() function selects variables with a character vector rather than unquoted variable name arguments. This function is useful because it is easier to programmatically generate character vectors with variable names than to generate unquoted variable names, which are easier to type.

select(flights, one_of(vars))

There are also other similar functions: all_of() and any_of()

select(flights, all_of(vars))
select(flights, any_of(vars))

These functions differ in their strictness. The function all_of() will raise an error if one of the variable names is not present, while any_of() will ignore it.

The select() function can now accept the name of a vector containing the variable names you wish to select:

select(flights, vars)
## Note: Using an external vector in selections is ambiguous.
## i Use `all_of(vars)` instead of `vars` to silence this message.
## i See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
## This message is displayed once per session.

However there is a problem with the previous code. The name vars could refer to a column named vars in flights or a different variable named vars. What th code does will depend on whether or not vars is a column in flights. If vars was a column in flights, then that code would only select the vars column. For example:

flights <- mutate(flights, vars = 1)
select(flights, !!!vars)

This behavior, which is used by many tidyverse functions, is an example of what is called non-standard evaluation (NSE) in R. See the dplyr vignette, Programming with dplyr, for more information on this topic.

4. Does the result of running the following code surprise you? How do the select helpersdeal with case by default? How can you change that default?

select(flights, contains("TIME"))